Transforming data - dplyr

In most of the cases you will need to change the format of your dataset because it will not be in the right format that you will need to plot or analyse the data. tidyverse has a package called dplyr which contains functions that help you to select columns/rows, sort, combine and perform other data types of data transformations. In the next sections we will look at different ways to transform our dataset. Now that we already know the basics of visualising data with ggplot we will also learn how to visualise other plots with the transformed dataset as we go along.

To learn more about dplyr please look at the following resources:

If you find these resources difficult to understand, return to these after completing the course.


Pipes

Before we move onto the different ways that we can transform our data, we’ll discuss one of the most powerful additions to R, pipes.

Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like %>% and are made available via the magrittr package, installed automatically with dplyr.

Let’s say we are interested only in the data that do not contain any missing values. Using pipes, we do the following:

surveys %>%
  drop_na()

This takes the argument on the left (surveys) and passes it on to the function after it (drop_na, which removes all rows with missing values).

If we want to create a new object with the transformed data we can assign it a new name as below:

surveys_complete <- surveys %>%
                  drop_na()

surveys_complete

Although at this point pipes might not seem like a revolutionary invention, they become more powerful when combining multiple operations. This we’ll see next.

Selecting columns

To select columns of a data frame or tibble, use theselect function. The first argument is the data frame or tibble you are working on (in our example it is surveys, which we pipe through), and the subsequent arguments are the columns to keep.

# Extract species_id, weight, hindfoot_lenth, year and sex columns from surveys dataset.
surveys %>% 
  select(species_id, weight, hindfoot_length, year, sex)

To select all columns except certain ones, put a - in front of the column to exclude it.

# Select all columns of the surveys dataset apart from record_id and species_id columns.
surveys %>%
  select(-record_id, -species_id)


Filtering rows

To remove rows from a data frame or tibble use the filter function from the dplyr package. The first argument is the data frame or tibble to perform the filtering on (in this case we pipe the data through) and the next arguments are the conditions on which to keep the rows.

Filtering rows by values

To choose rows based on a specific condition, use the filter function as follows:

# Keep only the observations of animals collected from 1995 onwards from the surveys dataset.
surveys %>% 
  filter(year >= 1995)

You can filter on multiple conditions:

# Keep only the observations of animals collected from 1995 onwards
# that are female from the surveys dataset.
surveys %>% 
  filter(year >=1995,
         sex == "F")

To quote or not to quote?

The sex column is a character and thus needs to be quoted, whereas the year column is numerical and does not. Also note that the filter arguments could have been written on a single line, but it is useful to break up your code sometimes to make it more readable.


Challenge - pipes

Subset the surveys_complete data to keep only the species_id, weight, hindfoot_length, year and sex columns and the animals collected on and after 1995. Then plot a scatter plot of weight (x-axis) against hindfoot_length (y-axis) using this transformed dataset. Do all the above using pipes, without creating any variables.

Answer
surveys_complete %>%
  # Select columns
  select(species_id, weight, hindfoot_length, year, sex) %>%
  # Filter rows
  filter(year >= 1995) %>%
  # Plot transformed data
  ggplot(mapping = aes(x = weight, y = hindfoot_length)) +
    geom_point()


Challenge - plotting subset with different colour

Plot all the animals in the surveys_complete dataset as weight (x-axis) against hindfoot_length (y-axis). Use the dataset created above which contains only the animals that were collected on and after 1995 and highlight these points in red in the plot. Hint: create a new object called survey_recent that contains the dataset created above.

Answer
survey_recent <- surveys_complete %>%
                  select(species_id, weight, hindfoot_length, year, sex) %>%
                  filter(year >= 1995)
                  
ggplot(mapping = aes(x = weight, y = hindfoot_length)) +
  geom_point(data = surveys_complete) +
  geom_point(data = survey_recent, colour = "red", alpha = 0.4)

Note: In the example above we did not specify the data argument in the ggplot function because the arguments in the ggplot function should be common throughout the whole plot. In this case we specified the respective data as separate layers in 2 different geom_point plots; the first plot is the data with all the animals in surveys_complete (points in black), the second plot is layered on top of the first plot and is a subset of the surveys_complete dataset (points in red).


Creating new columns

Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions, or to find the ratio of values in two columns. For this we’ll use the mutate function.

To create a new column of weight in kg:

surveys_complete %>%
  mutate(weight_kg = weight / 1000)

You can also create a second new column based on the first new column within the same call of mutate():

surveys_complete %>%
  mutate(weight_kg = weight / 1000,
         weight_lbs = weight_kg * 2.20462)

There are other ways on how to create new columns. Refer to the dplyr cheat sheet Make New Variables section.


Challenge - the boxplot

Using the surveys_complete dataset, create a boxplot for each year on the x-axis and weight in kg on the y-axis. Use the subset dataset that was transformed in the previous challenge, i.e. the one that has the recent observations (from 1995 onwards). As before try to do all the operations using pipes, without creating variables.
Hint: You will need to use the as_factor function to convert the year column to factor type.

Answer

Your first attempt would be to use the code we used before to subset the surveys_complete dataset, but also adding the mutate function to add a new weight_kg column. Further more you will need to also add the ggplot function to draw a boxplot:

surveys_complete %>%
  # Add weight_kg column
  mutate(weight_kg = weight / 1000) %>%
  # Select columns
  select(species_id, weight_kg, hindfoot_length, year, sex) %>%
  # Filter rows
  filter(year >= 1995) %>%
  # Plot transformed data
  ggplot(mapping = aes(x = year, y = weight_kg)) +
    geom_boxplot()
#> Warning: Continuous x aesthetic
#> ℹ did you forget `aes(group = ...)`?

You might notice that this code produces a warning message about the continuous data type of the x-axis and also creates a box plot with one single box and whisker averaged over all the years on the x-axis, rather than one for each year in the dataset as requested. As shown in the ggplot2 cheat sheet, box plots display a discrete variable on the x-axis and a continuous variable on the y-axis. At the moment year is treated as a continuous variable since its data type is double (you can see this from the tibble). To fix this, we need to convert the data type of the year column to a categorical one (factor) so that each year can be plotted as a separate box and whisker.

The forcats package that is part of tidyverse has the as_factor function which does just that; it converts a variable into a factor. See the updated code below, which uses the as_factor function:

surveys_complete %>%
  # Add weight_kg column
  mutate(weight_kg = weight / 1000) %>%
  # Select columns
  select(species_id, weight_kg, hindfoot_length, year, sex) %>%
  # Filter rows
  filter(year >= 1995) %>%
  # Convert the year column to a factor
  mutate(year = as_factor(year)) %>%
  # Plot transformed data
  ggplot(mapping = aes(x = year, y = weight_kg)) +
    geom_boxplot()

Notice that since we need to keep only observations that have been observed from 1995 onwards, we need to convert the year column to a factor after we filter the rows, as otherwise the row filtering operation will not be possible. This is because the new year column of type factor does not have any ordering in its categories so perfoming year >= 1995 would not be possible.

It is also possible to convert the year variable directly from within ggplot, thereby skipping the mutate step from above. We can do this using the updated code below (output is the same as above):

surveys_complete %>%
  # Add weight_kg column
  mutate(weight_kg = weight / 1000) %>%
  # Select columns
  select(species_id, weight_kg, hindfoot_length, year, sex) %>%
  # Filter rows
  filter(year >= 1995) %>%
  # Convert the year variable and plot transformed data
  ggplot(mapping = aes(x = as_factor(year), y = weight_kg)) +
    geom_boxplot()

Note on the boxplot

As mentioned the boxplot is used to visualise the distribution of a continuous variable. In the example above we displayed the distribution of weight over the different years. The figure below illustrates the different components of the boxplot. The bottom line of the box represents the 25th quantile, middle line the 50th quantile (or median) and the top line of the box is the 75th quantile. The whiskers can be thought of as defining the boundary of the confidence intervals of the distribution. The dots outside these whiskers are the outliers.


Sorting data

To sort your data dplyr provides function arrange.

# Sort weight in ascending order (default)
surveys_complete %>%
  arrange(weight)

To sort your data in descending order you will need to use desc().

# Sort weight in descending order 
surveys_complete %>%
  arrange(desc(weight))

You can sort your dataset based on the values of multiple columns:

# Sort weight in ascending order and hindfoot_length in descending order
surveys_complete %>%
  arrange(weight, desc(hindfoot_length))

As you can see from the result returned, the animals with the smallest weight are at the top. When there is a tie, i.e., more than one animal has the same weight, the animals are sorted in descending order of hindfoot_length. As you can see, the subset of animals with weight of 4 have been sorted in descending order based on hindfoot_length.


Summarising data

Creating summaries of your data would be a good way to start describing the variable you are working with. Summary statistics are a good example of how one can summarise data. We will not cover details about summary statistics in this course, but we will look at how we can summarise data in R.

Frequency - count

The most straightforward summary data is counting the the number of observations or frequencies. Frequencies are normally calculated when working with discrete variables that have a finite number of values, such as categorical data. In our surveys_complete dataset, let us obtain the frequecies of male and female animals present. We can do this by counting the number of “M” and “F” present in the dataset. To do this use the dplyr function count as follows:

  surveys_complete %>%
    count(sex)
#> # A tibble: 2 × 2
#>   sex       n
#>   <chr> <int>
#> 1 F     14520
#> 2 M     16031

As you can see count has grouped the categories present in the sex column and returned the frequency of each category. If we wanted to count combination of factors, such as sex and species, we would specify the first and the second factor as the arguments of count():

surveys_complete %>%
  count(sex, species) 
#> # A tibble: 41 × 3
#>    sex   species         n
#>    <chr> <chr>       <int>
#>  1 F     albigula      606
#>  2 F     baileyi      1593
#>  3 F     eremicus      539
#>  4 F     flavus        711
#>  5 F     fulvescens     55
#>  6 F     fulviventer    15
#>  7 F     hispidus       91
#>  8 F     leucogaster   436
#>  9 F     leucopus       16
#> 10 F     maniculatus   354
#> # ℹ 31 more rows

Challenge - count data

  1. How many animals were observed in each plot_type surveyed?
Answer
surveys_complete %>%
     count(plot_type) 


Challenge - frequencies

  1. What is the frequency of each species of each sex observed? Sort each species in descending order of frequency.
Answer
surveys_complete %>%
  count(sex, species) %>%
  arrange(species, desc(n))


Plotting histograms

If you would like to plot the distribution of a single continuous variable the frequency will be automatically calculated, so you do not need to use count() to calculate the frequency beforehand. The x-axis is automatically divided into bins and the number of observations of the continuous variable in each bin is shown as a bar in the histogram. In the ggplot2 package a histogram can be plotted using the geom_histogram function.

Let us plot a histogram for the continuous variable weight:

ggplot(surveys_complete, aes(weight)) +
  geom_histogram()

You can identify display categories in the histogram by plotting a stacked histogram which will show categories for each group stacked on top of each other. This is done by using the fill argument in the aesthetic function. If we want to display sex in our weight histogram:

ggplot(surveys_complete, aes(weight, fill = sex)) +
  geom_histogram(bins = 100)

Note that the default number of bins in a histogram is 30. To get have a granular display you can increase the number of bins by using the argument bins in the geom_histogram function as above.

There are other plots that can be used for a single continuous variable (see ONE VARIABLE continuous section on ggplot2 cheat sheet).

Summarising data with summarise()

When working with continuous variables, one of the most popular summary statistic is the mean. If we try to caclulate the mean on weight in the surveys_complete dataset we get an error:

surveys_complete %>%
  mean_weight = mean(weight)
#> Error in eval(expr, envir, enclos): object 'weight' not found

This is because in dplyr you will need to use the summarise function to be able to create summaries of your data.

The summarise function is used when you want to reduce multiple values in your columns to a single value. If we try to calculate the mean using summarise() this time:

surveys_complete %>%
  summarise(mean_weight = mean(weight))
#> # A tibble: 1 × 1
#>   mean_weight
#>         <dbl>
#> 1        41.8

As seen from the result, the mean of the values present in the weight column is 41.8. This value is the mean of weight of the entire data set. We already know that there are many different groups within the data set, so this is not very informative. It would therefore be good to calculate mean values for groups within the data. We cover how to do this next.

Grouping data

In the examples above we learned how to summarise data over all observations, e.g., we calculated the mean over all observations using the summarise function. However, in data analysis, especially when dealing with big data, a common approach to data exploration is the split-apply-combine strategy. The idea behind this strategy is to split the data into more manageable pieces, apply any operations required on the data independently on each piece and then combine the results together. The figure below illustrates the approach that is done in the split-apply-combine approach.

Let’s work on an example on how we can apply the split-apply-combine strategy on the surveys_complete dataset. We would like to split the data by the different categories present in the sex column and calculate the mean weight for each category. We can do this easily by using the group_by function in the dplyr package:

surveys_complete %>%
  group_by(sex) %>%
  summarise(mean_weight = mean(weight))
#> # A tibble: 2 × 2
#>   sex   mean_weight
#>   <chr>       <dbl>
#> 1 F            41.5
#> 2 M            42.1

You can also group by multiple columns:

surveys_complete %>%
  group_by(sex, species_id) %>%
  summarise(mean_weight = mean(weight))
#> `summarise()` has grouped output by
#> 'sex'. You can override using the
#> `.groups` argument.
#> # A tibble: 46 × 3
#> # Groups:   sex [2]
#>    sex   species_id mean_weight
#>    <chr> <chr>            <dbl>
#>  1 F     BA                9.16
#>  2 F     DM               41.6 
#>  3 F     DO               48.5 
#>  4 F     DS              117.  
#>  5 F     NL              154.  
#>  6 F     OL               30.8 
#>  7 F     OT               24.8 
#>  8 F     OX               22   
#>  9 F     PB               30.2 
#> 10 F     PE               22.8 
#> # ℹ 36 more rows

Once the data are grouped, you can also summarise multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum weight for each species for each sex:

surveys_complete %>%
  group_by(sex, species_id) %>%
  summarise(mean_weight = mean(weight),
            min_weight = min(weight))
#> `summarise()` has grouped output by
#> 'sex'. You can override using the
#> `.groups` argument.
#> # A tibble: 46 × 4
#> # Groups:   sex [2]
#>    sex   species_id mean_weight min_weight
#>    <chr> <chr>            <dbl>      <dbl>
#>  1 F     BA                9.16          6
#>  2 F     DM               41.6          10
#>  3 F     DO               48.5          12
#>  4 F     DS              117.           45
#>  5 F     NL              154.           32
#>  6 F     OL               30.8          10
#>  7 F     OT               24.8           5
#>  8 F     OX               22            22
#>  9 F     PB               30.2          12
#> 10 F     PE               22.8          11
#> # ℹ 36 more rows

Challenge - grouping & summarising

Use group_by() and summarise() to find the mean, min, and max hindfoot length for each species (using species_id). Also add the number of observations in a column called n_obs (hint: see ?n).

Answer
surveys_complete %>%
    group_by(species_id) %>%
    summarise(
        mean_hindfoot_length = mean(hindfoot_length),
        min_hindfoot_length = min(hindfoot_length),
        max_hindfoot_length = max(hindfoot_length),
        n_obs = n()
    )
#> # A tibble: 24 × 5
#>    species_id mean_hindfoot_length min_hindfoot_length max_hindfoot_length n_obs
#>    <chr>                     <dbl>               <dbl>               <dbl> <int>
#>  1 BA                         13                     6                  16    45
#>  2 DM                         36.0                  16                  50  9702
#>  3 DO                         35.6                  26                  64  2781
#>  4 DS                         50.0                  39                  58  2023
#>  5 NL                         32.2                  21                  42  1045
#>  6 OL                         20.5                  12                  39   905
#>  7 OT                         20.3                  13                  50  2062
#>  8 OX                         20.4                  19                  21     5
#>  9 PB                         26.1                   2                  47  2762
#> 10 PE                         20.2                  11                  30  1197
#> # ℹ 14 more rows


Challenge - grouping and sorting

What was the heaviest animal measured in each year? Return the columns year, genus, species, and weight.

Answer
surveys_complete %>%
    group_by(year) %>%
    filter(weight == max(weight)) %>%
    select(year, genus, species_id, weight) %>%
    arrange(year)
#> # A tibble: 28 × 4
#> # Groups:   year [26]
#>     year genus     species_id weight
#>    <dbl> <chr>     <chr>       <dbl>
#>  1  1977 Dipodomys DS            149
#>  2  1978 Neotoma   NL            232
#>  3  1978 Neotoma   NL            232
#>  4  1979 Neotoma   NL            274
#>  5  1980 Neotoma   NL            241
#>  6  1980 Neotoma   NL            241
#>  7  1981 Neotoma   NL            251
#>  8  1982 Neotoma   NL            252
#>  9  1983 Neotoma   NL            256
#> 10  1984 Neotoma   NL            259
#> # ℹ 18 more rows


Exporting/Writing data to files

Now that you have learned how to use dplyr to transform your raw data, you may want to export these new datasets to share them with your collaborators or for archival.

Similar to the read_csv function used for reading CSV files into R, there is a write_csv function that generates CSV files from data frames and tibbles which is also present in the readr package.

Before using write_csv(), note that we are going to write the data to the data_processed folder. This is to keep these data separate from our raw data in data_raw.

Let’s save the surveys_complete tibble in data_processed/surveys_complete.csv file:

write_csv(surveys_complete, "data_processed/surveys_complete.csv")

Additional information

The topics we have covered so far are just a small fraction of what is possible in R. Below is information on some more advanced topics of data wrangling and plotting. The topics and challenges are hopefully self-explanatory!

Facetting

The ggplot2 package has a way of creating different plots based on the different categories in the data. This is known as facetting. With facetting we do not need to use group_by() to split the data into different groups to be able to plot the different categories in different plots as ggplot2 does this automatically.

There are two types of facet functions:

  • facet_wrap() arranges the different plots into muliple rows and columns to cleanly fit on one page.
  • facet_grid() plots all the categories in 1 row or 1 column.

Let us see this in action.

surveys_complete %>% 
  group_by(year, genus) %>% 
  count() %>% 

ggplot(mapping = aes(x = year, y = n)) +
    geom_line() +
    facet_wrap(facets = vars(genus))

As you can see, each genus has been plotted as a separate plot. It is now clear which are the genera that were observed the most. Another advantage of facetting is that it uses a common axes and all plots are aligned to the same values on the axes, making the different plots comparable. If you want to have different axes for each plot you can do so by using the scales argument.

surveys_complete %>% 
  group_by(year, genus) %>% 
  count() %>% 

ggplot(mapping = aes(x = year, y = n)) +
    geom_line() +
    facet_wrap(facets = vars(genus), scales = "free")

The pattern of the graphs that before were hardly visible, e.g., Baiomys, is now clear as the axes have been rescaled to fit the data. This is the main advantage of using free scales. The disadvantage is that the different plots are not comparable as before.

If we would like to see if there is any difference between the sex, we can do this by adding sex as another grouping to count(). Let’s assign it to an object yearly_sex_counts.

yearly_sex_counts <- surveys_complete %>%
                      count(year, genus, sex)

yearly_sex_counts will now look like:

#> # A tibble: 389 × 4
#>     year genus           sex       n
#>    <dbl> <chr>           <chr> <int>
#>  1  1977 Chaetodipus     F         3
#>  2  1977 Dipodomys       F       103
#>  3  1977 Dipodomys       M       119
#>  4  1977 Onychomys       F         2
#>  5  1977 Onychomys       M         1
#>  6  1977 Perognathus     F        14
#>  7  1977 Perognathus     M         8
#>  8  1977 Peromyscus      M         2
#>  9  1977 Reithrodontomys F         1
#> 10  1977 Reithrodontomys M         1
#> # ℹ 379 more rows

This should now allow us to also split by sex. We can use colour to distinguish between the sex categories:

ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, colour = sex)) +
  geom_line() +
  facet_wrap(facets = vars(genus))

Let us do the same thing with facet_grid() so that we can understand the difference between the two facetting techniques in the ggplot2 package. With facet_grid() you specify what variable you would like to split on as in the rows or cols arguments:

surveys_complete %>% 
  group_by(year, genus) %>% 
  count() %>% 
  
ggplot(mapping = aes(x = year, y = n)) +
    geom_line() +
    # Display the genera as columns
    facet_grid(cols = vars(genus))

As you can see facet_grid() placed all the categories of genus in 1 row, unlike facet_wrap() which have spread them over multiple rows to fit well in 1 page. Let us split the plots by sex as well by plotting sex as the rows:

ggplot(data = yearly_sex_counts, 
  mapping = aes(x = year, y = n)) +
  geom_line() +
  facet_grid(rows = vars(sex), cols = vars(genus))

More information on further functionality of facetting can be found in the facet_wrap() and facet_grid() documentation.

Challenge

Instead of splitting the plots based on sex display the sex as different coloured line graphs in the same plot.

ggplot(data = yearly_sex_counts, 
 mapping = aes(x = year, y = n, colour = sex)) +
 geom_line() +
 facet_grid(cols = vars(genus))


Reshaping data - pivot

In the data exploration workflow we discussed how to structure our data leading to the four rules defining a tidy dataset:

  1. Each variable has its own column
  2. Each observation has its own row
  3. Each value must have its own cell
  4. Each type of observational unit forms a table

Here we examine the fourth rule: Each type of observational unit forms a table.

In surveys , the rows of surveys contain the values of variables associated with each record (the unit), values such as the weight or sex of each animal associated with each record. What if instead of comparing records, we wanted to compare the different mean weight of each genus between plots? (Ignoring plot_type for simplicity).

We’d need to create a new table where each row (the unit) is comprised of values of variables associated with each plot. In practical terms this means the values in genus would become the names of column variables and the cells would contain the values of the mean weight observed on each plot.

Having created a new table, it is therefore straightforward to explore the relationship between the weight of different genera within, and between, the plots. The key point here is that we are still following a tidy data structure, but we have reshaped the data according to the observations of interest: average genus weight per plot instead of recordings per date.

The opposite transformation would be to transform column names into values of a variable.

We can do both these of transformations with two tidyr functions, pivot_wider() and pivot_longer().

Widening data with pivot_wider()

pivot_wider() takes three principal arguments:

  1. the data
  2. the names_from column variable whose values will become new column names.
  3. the values_from column variable whose values will fill the new column variables.

Further arguments include values_fill which, if set, fills in missing values with the value provided.

Let’s use pivot_wider() to transform surveys to find the mean weight of each genus in each plot over the entire survey period. We use filter(), group_by() and summarise() to filter our observations and variables of interest, and create a new variable for the mean_weight. We use the pipe as before too.

surveys_gw <- surveys %>%
  filter(!is.na(weight)) %>%
  group_by(plot_id, genus) %>%
  summarise(mean_weight = mean(weight))
#> `summarise()` has grouped output by
#> 'plot_id'. You can override using the
#> `.groups` argument.
str(surveys_gw)

This yields surveys_gw where the observations for each plot are spread across multiple rows, 196 observations of 3 variables. Using pivot_wider() to key on genus with values from mean_weight this becomes 24 observations of 11 variables, one row for each plot. We again use pipes:

surveys_wider <- surveys_gw %>%
  pivot_wider(names_from = genus, values_from = mean_weight)
str(surveys_wider)

We could now plot comparisons between the weight of genera in different plots, although we may wish to fill in the missing values first, using the values_fill argument. Note: be careful when replacing NA with zero, since a mean_weight of would be recognised as an actual measurement, and not as a missing value!

surveys_gw %>%
  pivot_wider(
    names_from = genus,
    values_from = mean_weight,
    values_fill = list(mean_weight = 0)
    )

Lengthening data with pivot_longer

The opposing situation could occur if we had been provided with data in the form of surveys_wider, where the genus names are column names, but we wish to treat them as values of a genus variable instead.

In this situation we are lenghtening the data by taking the column names and turning them into a pair of new variables. One variable represents the column names as values, and the other variable contains the values previously associated with the column names.

pivot_longer() takes four principal arguments:

  1. the data
  2. the cols containing the names of the columns we want to pivot into a longer format
  3. the names_to containing a string specifying the name of the column to create from the data stored in the column names of data
  4. the values_tocontaining a string specifying the name of the column to create from the data stored in cell values.

To recreate surveys_gw from surveys_wider we would create a column using names_to called genus. We would also need to create a column called mean_weight using the values_to argument. Lastly, we would use all columns apart from plot_id to pivot into a longer format. Here we drop plot_id column with a minus sign.

surveys_longer <- surveys_wider %>%
  pivot_longer(
    cols = -plot_id,
    names_to = "genus",
    values_to = "mean_weight")
str(surveys_longer)

Note that now the NA genera are included in the longer format. Making data wider and then longer can be a useful way to balance out a dataset so every replicate has the same composition.

We could also have used a specification for what columns to include. This can be useful if you have a large number of identifying columns, and it’s easier to specify what to gather than what to leave alone. And if the columns are in a row, we don’t even need to list them all out - just use the : operator!

surveys_wider %>%
  pivot_longer(
    cols = Baiomys:Spermophilus,
    names_to = "genus",
    values_to = "mean_weight") %>%
  head()

Challenge

  1. Take the surveys data frame and make it wider, using year as columns, plot_id as rows, and the number of genera per plot as the values. You will need to summarise before reshaping, and use the function n_distinct() to get the number of unique genera within a particular chunk of data. It’s a powerful function! See ?n_distinct for more.
surveys_wider_genera <- surveys %>%
  group_by(plot_id, year) %>%
  summarise(n_genera = n_distinct(genus)) %>%
  pivot_wider(names_from = year, values_from = n_genera)
#> `summarise()` has grouped output by
#> 'plot_id'. You can override using the
#> `.groups` argument.
head(surveys_wider_genera)
  1. Now take that data frame and pivot_longer() it again, so each row is a unique plot_id by year combination.
surveys_wider_genera %>%
  pivot_longer(
  cols = -plot_id,
  names_to = "year",
  values_to = "n_genera")
  1. The surveys data set has two measurement columns: hindfoot_length and weight. This makes it difficult to do things like look at the relationship between mean values of each measurement per year in different plot types. Let’s walk through a common solution for this type of problem. First, use pivot_longer() to create a dataset where we have a key column called measurement and a value column that takes on the value of either hindfoot_length or weight. Hint: You’ll need to specify which columns are being pivoted.
surveys_long <- surveys %>%
  pivot_longer(
  cols = c(hindfoot_length, weight),
  names_to = "measurement",
  values_to = "value")
  1. With this new data set, calculate the average of each measurement in each year for each different plot_type. Then make the data set wider using pivot_wider() with a column for hindfoot_length and weight. Hint: You only need to specify the name and value columns for pivot_wider().
surveys_long %>%
  group_by(year, measurement, plot_type) %>%
  summarise(mean_value = mean(value, na.rm = TRUE)) %>%
  pivot_wider(names_from = measurement, values_from = mean_value)
#> `summarise()` has grouped output by
#> 'year', 'measurement'. You can override
#> using the `.groups` argument.


Reshaping data - joining tables

Often data is spread out over multiple tables, instead of in one large, single table. Why collect data in multiple tables? - Not efficient to include all information in a single table. - Redundant information makes it more difficult to update or revise data. - Make changes in one place, not hundreds of places. - Use multiple tables - Each table contains a single kind of information, for example in our case we could split the data into three tables: - surveys: information about individuals - species: information about species - plots: information about plots - If a species name changes we only need to change it in the species table - Connect tables using joins to describe relationships between tables

To illustrate the joining of different tables we will create two tables: hindfoot_length_join and weight_join, which contain the average hindfoot_length and weight per genus, respectively. To simplify the example, we are only using three genera, two of which are common between the tables. These are taken using the slice function. For more information, see ?slice.

# Create a summary table for mean hindfoot_length
hindfoot_length_join <- surveys %>%
  group_by(genus) %>% 
  summarise(hindfoot_length = mean(hindfoot_length, na.rm = TRUE)) %>% 
  slice(4,8,11)

# Create a summary table for mean weight
weight_join <- surveys %>%
  group_by(genus) %>% 
  summarise(weight = mean(weight, na.rm = TRUE)) %>% 
  slice(4,8,15)

hindfoot_length_join
#> # A tibble: 3 × 2
#>   genus       hindfoot_length
#>   <chr>                 <dbl>
#> 1 Baiomys                13  
#> 2 Chaetodipus            23.9
#> 3 Dipodomys              37.9
weight_join
#> # A tibble: 3 × 2
#>   genus       weight
#>   <chr>        <dbl>
#> 1 Baiomys       8.6 
#> 2 Chaetodipus  24.2 
#> 3 Perognathus   8.38

As can be seen above, the Baiomys and Chaetodipus genera are common between the two tables. Dipodomys only occurs in hindfoot_length_join and Perognathus only in weight_join.

Joining of tables is done using the mutating join functions in dplyr documentation. All of join functions take three arguments, two table names (the tables to join, the left-hand side and right-hand side) and a common key defined in by =. Here we look at the four most commonly used join functions, which are illustrated in the image below.

  1. left-join. This joins matching rows from the right-hand side table to the left-hand side table.
  2. right_join. This joins matching rows from the left-hand side table to the right-hand side table.
  3. inner_join. This retains only rows that occur in both tables.
  4. full_join. This retains all values from all rows.

In this case we want to join the data by genus, so we use by = genus as the common key.


Customisation of plots

Though the default visualisation of ggplot2 plots is already at a good standard, there are several ways one can improve even further the visualisations.

Labels

Let us start customising the last plot we have plotted by renaming the axes and adding a title to the plot. This is done by using the labs function:

ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n)) +
  geom_line() +
  facet_grid(cols = vars(genus)) +
  labs(title = "Observed genera through time",
       x = "Year of observation",
       y = "Number of animals")

The major item that needs fixing in the plot is the text on the x-axis as this crammed and is not readable at the moment. This is mainly due to the fact that the size of the plot is dependent on the size of the window (in this case RStudio). You can work around this by saving your plot to a file and specifying the width of the plot (see Saving a plot to a file section). Themes in the ggplot2 package control the display of all non-data elements of the plot. Let us start customising the text on the x-axis by changing its size and position using the theme function. Note that theme() has several other arguments and you can read more about them in the theme() documentation.

ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, colour = sex)) +
    geom_line() +
    facet_grid(cols = vars(genus)) +
    labs(title = "Observed genera through time",
         x = "Year of observation",
         y = "Number of animals")  +
    theme(axis.text.x = element_text(size = 7, angle = 90, vjust = 0.5), 
          axis.text.y = element_text(size = 7), 
          strip.text=element_text(size = 7, angle = 45))

Legend

With the plot already looking better, the last thing we would like to change is the legend. Legends are very tricky in ggplot2 as the fuction to use is determined by the data that is being displayed. In this case the legend has been created based on colour groupings. Therefore we can change the lengend title, categories and colour as follows:

ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, colour = sex)) +
    geom_line() +
    facet_grid(cols = vars(genus)) +
    labs(title = "Observed genera through time",
         x = "Year of observation",
         y = "Number of animals")  +
    theme(axis.text.x = element_text(size = 7, angle = 90, vjust = 0.5), 
          axis.text.y = element_text(size = 7), 
          strip.text = element_text(size = 7, angle = 45)) +
    scale_colour_brewer("Sex", 
                       palette = "Set1", 
                       breaks = c("F", "M"), 
                       labels = c("Female", "Male"))

Note: If you would like to see what other palettes are available please see http://colourbrewer2.org/#type=qualitative&scheme=Set1&n=3.

Themes

ggplot2 has a set of themes that can be used to change the overall appearance of the graph without much effort. For example, if we create the first plot again and apply the theme_bw() theme we get a more simpler white background:

ggplot(data = surveys, mapping = aes(x = weight, y = hindfoot_length)) +
  geom_point() +
  theme_bw()

A list of themes can be found in the ggplot2 documentation.

Challenge

Use what you just learned to create a plot that depicts how the average weight of each species changes through the years.

surveys_complete %>%
 group_by(year, species) %>%
 summarise(avg_weight = mean(weight)) %>%
 ggplot(mapping = aes(x = year, y = avg_weight)) +
   geom_line() +
   facet_wrap(facets = vars(species)) +
   theme_bw() +
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
#> `summarise()` has grouped output by
#> 'year'. You can override using the
#> `.groups` argument.